[tests] Update lm_eval VL tests to qwen 3 #1953

brian-dellabetta · 2025-10-20T18:26:10Z

SUMMARY:
Upgrade the lm_eval vision languge tests from Qwen 2.5 to Qwen 3. After updating to include apply_chat_template, the scores closely align with what was achieved with Qwen 2.5

switch to neuralmagic/calibration dataset, based on suggestion here, to avoid tracing issues related to VL dataset.
switch to chartqa task, to increase number of samples to 500 and reduce variance in accuracy.
pruned unused datasets (slimorca and llm_compression_calibration)

TEST PLAN:
The 3 lm_eval VL tests were run, and the accuracies were updated

vl_fp8_dynamic_per_token.yaml runs in ~29m
vl_int8_w8a8_dynamic_per_token.yaml runs in ~37m
vl_w4a16_actorder_weight.yaml runs in ~34m

github-actions · 2025-10-20T18:26:19Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

dsikka

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

brian-dellabetta · 2025-10-20T19:24:56Z

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

mmmu_val is 900 evals total instead of 30. that would add probably ~40 minutes to each lm-eval run, and we run two for each config, so total test time would increase over 3 hours with that change

dsikka · 2025-10-20T20:19:38Z

Why not use mmmu_val

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

mmmu_val is 900 evals total instead of 30. that would add probably ~40 minutes to each lm-eval run, and we run two for each config, so total test time would increase over 3 hours with that change

The 30 datapoints has proven to be very noisy historically. A happy medium might be better but we should also just validate the runtime for batch size of 100

_template true Signed-off-by: Brian Dellabetta <[email protected]>

Signed-off-by: Brian Dellabetta <[email protected]>

tests/lmeval/test_lmeval.py

kylesayrs

Woop

dsikka

LGTM. 3 questions:

Do we need to keep all the datasets in the testing_utils? Are thre some there that we can remove?
Is 100 enough?
Did we mention to MLR the variation we see without the chat_template?

brian-dellabetta · 2025-10-24T20:21:39Z

LGTM. 3 questions:

1. Do we need to keep all the datasets in the testing_utils? Are thre some there that we can remove?

2. Is 100 enough?

3. Did we mention to MLR the variation we see without the chat_template?

@dsikka thanks, see responses below:

we could prune the code in testing_utils. should i get rid of gsm8k, open-platypus and slim-orca?
I can up this as well. I set it to 100 because tests were taking forever, but i think the cpu of the cluster was just under heavy load when i was trying. 500?
I can do so Monday

rahul-tuli

LGTM!

dsikka · 2025-10-28T17:41:08Z

LGTM. 3 questions:
1. Do we need to keep all the datasets in the testing_utils? Are thre some there that we can remove?

2. Is 100 enough?

3. Did we mention to MLR the variation we see without the chat_template?
@dsikka thanks, see responses below:

we could prune the code in testing_utils. should i get rid of gsm8k, open-platypus and slim-orca?

I can up this as well. I set it to 100 because tests were taking forever, but i think the cpu of the cluster was just under heavy load when i was trying. 500?

I can do so Monday

Yeah I think we shoud up to 500 and remove any testing dataset that we're not using

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta requested review from dsikka, kylesayrs and rahul-tuli October 20, 2025 18:26

dsikka requested changes Oct 20, 2025

View reviewed changes

qwen 3 vl with apply_chat

57e50b1

_template true Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta force-pushed the bdellabe/qwen3-vl-lmeval branch from ea00c16 to 57e50b1 Compare October 21, 2025 21:57

brian-dellabetta added 3 commits October 21, 2025 22:46

chartqa p1

1e01353

Signed-off-by: Brian Dellabetta <[email protected]>

broken test

2fc4001

Signed-off-by: Brian Dellabetta <[email protected]>

neuralmagic/calibration dataset

73f11ed

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta requested a review from dsikka October 22, 2025 20:19

brian-dellabetta marked this pull request as ready for review October 22, 2025 20:19

kylesayrs reviewed Oct 22, 2025

View reviewed changes

tests/lmeval/test_lmeval.py Show resolved Hide resolved

brian-dellabetta requested review from fynnsu and kylesayrs October 24, 2025 14:31

kylesayrs previously approved these changes Oct 24, 2025

View reviewed changes

dsikka reviewed Oct 24, 2025

View reviewed changes

rahul-tuli previously approved these changes Oct 27, 2025

View reviewed changes

Merge branch 'main' into bdellabe/qwen3-vl-lmeval

f628810

prune unused datasets, up to 500 samples

6c51e03

Signed-off-by: Brian Dellabetta <[email protected]>

brian-dellabetta dismissed stale reviews from rahul-tuli and kylesayrs via 6c51e03 October 28, 2025 17:56

brian-dellabetta requested review from dsikka, kylesayrs and rahul-tuli October 28, 2025 17:57

HDCharles approved these changes Oct 28, 2025

View reviewed changes

kylesayrs approved these changes Oct 28, 2025

View reviewed changes

brian-dellabetta added the ready When a PR is ready for review label Oct 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[tests] Update lm_eval VL tests to qwen 3 #1953

[tests] Update lm_eval VL tests to qwen 3 #1953

brian-dellabetta commented Oct 20, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 20, 2025

Uh oh!

dsikka left a comment

Uh oh!

brian-dellabetta commented Oct 20, 2025

Uh oh!

dsikka commented Oct 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

kylesayrs left a comment

Uh oh!

dsikka left a comment

Uh oh!

brian-dellabetta commented Oct 24, 2025

Uh oh!

rahul-tuli left a comment

Uh oh!

dsikka commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[tests] Update lm_eval VL tests to qwen 3 #1953

Are you sure you want to change the base?

[tests] Update lm_eval VL tests to qwen 3 #1953

Conversation

brian-dellabetta commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 20, 2025

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta commented Oct 20, 2025

Uh oh!

dsikka commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta commented Oct 24, 2025

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

brian-dellabetta commented Oct 20, 2025 •

edited

Loading

dsikka commented Oct 20, 2025 •

edited

Loading